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Abstract 

The long short-term memory (LSTM) neural network is 
capable of processing complex sequential information since 
it utilizes special gating schemes for learning representa¬ 
tions from long input sequences. It has the potential to 
model any sequential time-series data, where the current 
hidden state has to be considered in the context of the past 
hidden states. This property makes LSTM an ideal choice 
to learn the complex dynamics of various actions. Un¬ 
fortunately, the conventional LSTMs do not consider the 
impact of spatio-temporal dynamics corresponding to the 
given salient motion patterns, when they gate the informa¬ 
tion that ought to be memorized through time. To address 
this problem, we propose a differential gating scheme for 
the LSTM neural network, which emphasizes on the change 
in information gain caused by the salient motions between 
the successive frames. This change in information gain is 
quantified by Derivative of States (DoS), and thus the pro¬ 
posed LSTM model is termed as differential Recurrent Neu¬ 
ral Network (dRNN). We demonstrate the effectiveness of 
the proposed model by automatically recognizing actions 
from the real-world 2D and 3D human action datasets. Our 
study is one of the first works towards demonstrating the po¬ 
tential of learning complex time-series representations via 
high-order derivatives of states. 

1. Introduction 

Recently, Recurrent Neural Networks (RNNs) [25], es¬ 
pecially Long Short-Term Memory (LSTM) model [12], 
have gained significant attention in solving many challeng¬ 
ing problems involving sequential time-series data, such as 
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action recognition [10, 6, 11], multilingual machine trans¬ 
lation [27, 3], multimodal translation between videos and 
sentences [30], and robot control [19]. In these applica¬ 
tions, learning an appropriate representation of sequences 
is an important step in achieving artificial intelligence. 

Compared with many existing spatio-temporal features 
[15, 26] from the time-series data, RNN use either a hidden 
layer [25] or a memory cell [12] to learn the time-evolving 
states which models the underlying dynamics of the input 
sequence. For example, [ 1 , 6] have used LSTM to model the 
video sequences to learn their long short-term dynamics. In 
contrast to the conventional RNN, the major component of 
LSTM is the memory cell which is modulated by three gates 
- input, output and forget gates. These gates determine the 
amount of dynamic information entering/leaving the mem¬ 
ory cell. The memory cell has a set of internal states, which 
store the information obtained over time. In this context, 
these internal states constitute a representation of an input 
sequence learned over time. 

In many recent works, the LSTMs have shown tremen¬ 
dous potential in action recognition tasks [1][11][6]. The 
existing LSTM model represents a video by integrating over 
time all the available information from each frame. How¬ 
ever, we observed that for an action recognition task, not all 
frames contain salient spatio-temporal information which 
are discriminative to different classes of actions. Many 
frames contain non-salient motions which are irrelevant to 
the performed action. 

This inspired us to develop a new family of LSTM model 
that automatically learns the dynamic saliency of the ac¬ 
tions performed. The conventional LSTM fails to learn the 
salient dynamic patterns comprehensively, since the gate 
units do not explicitly consider whether a frame contains 
salient motion information when they modulate the mem¬ 
ory cells. Thus the model is insensitive to the dynamic evo- 
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lution of the hidden states given the input video sequences. 
To address this problem, we propose the differential RNN 
(dRNN) model that learns these salient spatio-temporal rep¬ 
resentations of actions. 

Specifically, dRNN models the dynamics of actions by 
computing different-orders of Derivative of States (DoS) 
that are sensitive to the spatio-temporal structure of actions. 
Depending on the DoS, the gate units can learn the appro¬ 
priate information that should be required to model the dy¬ 
namic evolution of actions. To train the dRNN model, we 
use truncated Back Propagation algorithm to prevent the ex¬ 
ploding or diminishing errors through time [12]. In particu¬ 
lar, we follow the rule that the errors propagated through the 
connections to those DoS nodes would be truncated once 
they leave the current memory cell. 

Finally, we demonstrate that the dRNNs can achieve 
the state-of-the-art performance on both 2D and 3D action 
recognition datasets. Specifically, dRNNs outperform the 
existing LSTM model on these action recognition tasks, 
consistently achieving the better performance with the same 
input sequences. On the other hand, when compared with 
the other algorithms tailored to model special assump¬ 
tions on spatio-temporal structure of actions, the proposed 
general-purpose dRNN model can still reach competitive 
performance. 

The remainder of this paper is organized as follows. In 
the next section 2, we review several related work to the 
action recognition problem. The background and details 
of RNNs and LSTMs are reviewed in section 3. Section 
4 presents the proposed differential RNNs (dRNNs). The 
experimental results are presented in section 5. Finally, we 
conclude and discuss the future work related to dRNNs in 
section 6. 

2. Related Work 

Action recognition has been a long-standing research 
problem in computer vision and pattern recognition com¬ 
munity, which aims to enable a computer to automatically 
understand the activities performed by people interacting 
with the surrounding environment and with each other [21]. 
This is a challenging problem due to the huge intra-class 
variance of actions performed by different actors at various 
speeds, in diverse environments (e.g., camera angles, light¬ 
ing conditions, and cluttered background). 

To address this problem, many robust spatio-temporal 
representations have been constructed. For example, 
HOG3D [15] uses the histogram of 3D gradient orientations 
to represent the motion structure over the frame sequences; 
3D-SIFT [26] extends the popular SIFT descriptor to char¬ 
acterize the scale-invariant spatio-temporal structure for 3D 
video volume; actionlet ensemble [31] utilizes a robust ap¬ 
proach to model the discriminative features from 3D posi¬ 
tions of the tracked joints captured by depth cameras. 


Although these descriptors have achieved remarkable 
success, they are usually engineered to model a specific 
spatio-temporal structure in an ad-hoc fashion. Recently, 
the huge success of deep networks in image classifica¬ 
tion [16] and speech recognition [9] has inspired many re¬ 
searchers to apply the deep neural networks, such as 3D 
Convolutional Neural Networks (3DCNNs) [2] and Recur¬ 
rent Neural Networks (RNNs) [1, 6], to action recognition. 
In particular, [2] developed a 3D convolutional neural net¬ 
work that extends the conventional CNN by taking space- 
time volume as input. On the contrary, [1,6] used LSTMs 
to represent the video sequences directly, and modeled the 
dynamic evolution of the action states via a sequence of 
memory cells. 

Meanwhile, the existing approaches combine deep neu¬ 
ral networks with spatio-temporal descriptors, achieving 
competitive performance. For example, in [2], a LSTM 
model takes a sequence of Harris3D and 3DCNN descrip¬ 
tors extracted from each frame as input, and the result on 
KTH dataset has shown the state-of-the-art performance 
[ 2 ]. 

3. Background 

In this section, we briefly review the recurrent neural net¬ 
work as well as its variant, long short-term memory model. 
Readers who are familiar with them might skip to the next 
section directly. 

3.1. Recurrent Neural Networks 

Traditional recurrent neural networks (RNNs) [25] 
model the dynamics of an input sequence of frames {x^ G 
W^\t = through a sequence of hidden states 

{st G W^\t = !,••• ,T} thereby learning the spatio- 
temporal structure of the input sequence. For example, a 
classical RNN model uses the following recurrent equation 

St = tanh(WssSt_i + + b^) (1) 

to model the hidden state St at time t by combining the in¬ 
formation from the current input Xt and the past hidden state 
St_i, where the hyperbolic tangent tanh(-) is an activation 
function with range [—1,1], and are two map¬ 
ping matrices to the hidden state, and is the bias vector. 

The hidden state can be mapped to an output sequence 
{zt G = 1, • • • , T} as 

zt = tanh(W;2sSt -h b;,) (2) 

where each Zt represents an 1-of-k encoding of the con¬ 
fidence scores on k classes of actions. Then, this output 
vector can be transformed to a vector of probabilities yt by 
softmax function as 


E exp(2:(,i) 
1=1 



with each entry yt^c being the probability of frame t belong¬ 
ing to class c G {1, • • • , /c}. 

3.2. Long Short-Term Memory 

The above classical RNN is limited in learning the long¬ 
term representation of video sequences, due to the expo¬ 
nential decay in retaining the context information of video 
frames [12]. To overcome this limitation, Long Short-Term 
Memory (LSTM) [12], a variant of RNN, has been designed 
to learn the long-range dependency between the output la¬ 
bel and the input frame, which has achieved competitive 
performance on action recognition task [ 1 ] [2] . 

In particular, LSTMs are composed of a sequence of 
memory cells, each containing an internal state St storing 
the memory of the input sequence up to time t. To store 
the memory with respect to a context in long period of time, 
three types of gate units are incorporated into LSTMs to 
control what information would enter and leave the mem¬ 
ory cell over time [12]. These gate units are activated by a 
nonlinear function of input/output sequences as well as in¬ 
ternal states, making them powerful enough to model the 
dynamically changing context given that the human actions 
evolve at various time scales. 

Formally, a LSTM cell has the following gates: 

1. The input gate it controls the degree to which the input 
information would enter the memory cell to influence its 
internal state St at time t. The activation of this gate has the 
following recurrent form 


L — 1 T" bt) 

where the sigmoid cr(-) is an activation function with the 
range [0,1], with 0 meaning the gate is closed and 1 mean¬ 
ing the gate is completely open; are the mapping ma¬ 
trices and is the bias vector. 

2. The forget gate ft modulates the previous state St-i to 
control its contribution to the current state (c.f. Eq(4)). It is 
deflned as 


ft = (7(W fsSt-1 + W fzZt-i + w + b/) 

with the mapping matrices W/* and the bias vector b/. 

With the input/forget gate units, the internal state of 
each memory cell can be updated below: 


St = ft 0 st-i + it O s^_ 1 (4) 


where we deflne the pre-state s^_ i as 

St_ 1 = tanh(Ws;2Zt-i + W^^Xt + b^). 

The pre-state can be considered as an intermediate state be¬ 
tween two consecutive frames, aggregating the information 
from the last output Zt-i and the current input Xt. Then it 


is combined with the gated information from the previous 
state St-i to update the current state St as in Eq. (4). 

3. The output gate Ot ’. 


“h Wo2;Zt —1 W Qx'X.f H- bo). 

It gates the information output from a memory cell which 
would influence the future states of LSTM cells. Then the 
output of a memory cell can be expressed as 


zt = Ot 0 tanh(W2sSt + b;^) (5) 


where 0 stands for element-wise product. 

In brief, LSTM proceeds by iteratively applying Eq. (4) 
and Eq. (5) to update the state St and output Zf. In this pro¬ 
cess, the forget gate, output gate and input gate play a criti¬ 
cal role in controlling the information entering and leaving 
the memory cell. More details about LSTMs can be found 
in [12]. 


4. Differential Recurrent Neural Networks 


For an action recognition task, not all video frames 
contain salient patterns to discriminate between different 
classes of actions. Many spatio-temporal descriptors, such 
as 3D-SIFT [26] and HoGHoF [17], have been proposed 
to localize and encode the salient spatio-temporal points. 
They detect and encode the spatio-temporal points related 
to salient motions of the objects in video frames, revealing 
the important dynamics of actions. 

In this paper, we develop a novel LSTM model to auto¬ 
matically learn the dynamics of actions, by detecting and 
integrating the salient spatio-temporal sequences. The con¬ 
ventional LSTMs might fail to capture these salient dy¬ 
namic patterns, because the gate units do not explicitly con¬ 
sider the impact of dynamic structures present in input se¬ 
quences. This makes the model inadequate to learn the evo¬ 
lution of action states. To address this problem, we propose 
a differential RNN (dRNN) model to learn and integrate the 
dynamics of actions. 

The proposed dRNN model is based on the observation 
that the internal state of each memory cell contains the 
accumulated information about the spatio-temporal struc¬ 
ture, i.e., it is a long short-term representation of an input 

ds 

sequence. So the Derivative of States (DoS) —— quan- 

dt 

tifles the change of information at each time t. In other 
words, a large magnitude of DoS is an indicator of a salient 
spatio-temporal structure containing the informative dy¬ 
namics caused by an abrupt change of action state. In this 
case, the gate units should allow more information to en¬ 
ter the memory cell to update its internal state. Otherwise, 
when the magnitude of DoS is small, the incoming informa¬ 
tion should be gated out of the memory cell so the internal 
state would not be affected by the current input. Therefore, 



Figure 1. Architecture of the proposed dRNN model at time t. In 
the memory cell, the input gate it and the forget gate ft are con¬ 
trolled by DoS at t — 1, and the output gate Ot is con¬ 
trolled by the DoS at t. 


DoS should be used as a factor to gate the information flow 
into and out of the internal state of memory cell over time. 

Moreover, we can involve higher-orders of DoS 

d^Sf 

> 2} to detect and capture the higher-order dy¬ 
namic patterns for the dRNN model. For example, when 
modeling a moving object in a video, the first-order DoS 
captures the velocity while the second-order captures its ac¬ 
celeration. These different orders of DoS will enable dRNN 
to better represent the dynamic evolution of action states. 

Figure 1 illustrate the architecture of the proposed dRNN 
model. Formally, we have the following recurrent equations 
to control the gate units with the DoS up to order N: 

u = + Wuzt-i + + bO (6) 

n=0 


ft = (7(^w; 


n=0 

N 


(n) 

fd ^f(n) 


+ W/zZt_l +W/a;Xt +b/) (7) 


ot = aiJ2 + Wo.zt_i + W„.xt + b.) (8) 


d^^^ Sf—1 (rf^) 

where-— is the n-order DoS, and are the cor- 

responding mapping matrices. 

Finally, it is worth pointing out that we do not use the 
derivative of inputs as a measurement of salient dynamics 
to control the gate units. The derivative of inputs would 
amplify the unwanted noises which are often contained in 
the input sequence. This derivative of inputs only represent 
the local dynamic saliency, in contrast to the long short-term 
change in the information gained over time. For example, a 
motion may have been performed several frames ago. Us¬ 
ing derivative of inputs would treat it as a novel salient mo¬ 
tion, even though it has already been stored by LSTM. On 
the contrary, DoS does not have this problem, because the 


internal state St has long-term memory of the past motion 
pattern, even though the same motion had previously oc¬ 
curred. 


4.1. Discretized Model 

Since the model is defined in the discrete-time domain, 
ds 

the first-order derivative —, as the velocity of information 
dt ^ 

change, can be discretized as the difference of states 


vt 


dst 

dt 


— St — St-l 


(9) 


for its simplicity [7] . 

Similarly, we can consider the second order of DoS as 
the acceleration of information change can be discretized as 

d?s+ 

at = = vt - vt-i = St - 2st-i + St -2 (10) 

In this paper, we only consider the first two orders of DoS. 
Higher orders can be derived in a similar way. 

With the above recurrent equations, at time step t, the 
dRNN model proceeds in the following order. 


Compute input gate activation it and forget gate acti¬ 
vation ft by Eq. (6) and Eq. (7); 

Update state St with it and ft by Eq. (4); 


d(n) 


Compute discretized DoS 1^ ~ ‘ up 

to order N at time t, e.g. Eq. (9) and Eq. (10); 

• Compute output gate Ot by Eq. (8); 

• Output Zt gated by Ot from memory cell by Eq. (5); 

• (Optional) Output the label yt by applying the softmax 
to Zt by Eq. (3). 


Now it is obvious that this model is termed differential 
RNNs (dRNNs) because of the central role of derivatives of 
states in detecting and capturing the salient spatio-temporal 
structures. 


4.2. Learning Algorithm 

To learn the model parameters of dRNNs, we define a 
loss function to measure the deviation between the target 
class Ct and yt at time t\ 


^iyt,ct) = -logyt,cf 

Eor an action recognition task, the label of action is of¬ 
ten given at the video level. Since LSTMs have the abil¬ 
ity to memorize the content of an entire sequence, the last 
memory cell of LSTMs ought to contain all the necessary 
information for action recognition. Thus, for a sequence of 






























length T, and a given training label c, the dRNNs can be 
trained by minimizing the loss at time T, i.e., ^(yr, c) = 
-logyr.c- 

Otherwise, if an individual label Ct is given to each frame 
t in the sequence, we can minimize the cumulative loss over 
the sequence: 

T 

t=l 

Both types of loss functions can be minimized by Back 
Propagation Through Time (BPTT) [4], which unfolds a 
dRNN model over several time steps and then runs the back 
propagation algorithm to train the model. To prevent the 
back-propagated errors from decaying or exploding expo¬ 
nentially, LSTMs usually use truncated BPTT [12]. The 
idea is rather simple: once the back-propagated error leaves 
the memory cell or gates, it will not be allowed to enter the 
memory cell again. In the proposed dRNNs, we also use the 
truncated errors to learn the model parameters. In particu¬ 
lar, we do not allow the errors to re-enter the memory cell 
once they leave it through the DoS nodes and at. 

Formally, we assume the following truncated derivatives 
of gate activations: 


and 


dit ^ gft 

dvt-i ’ dvt-i 


0 , 


dot 

dwt 


= 0 


dit o dit 

dsit-i ’ dsit-i 


0 , 


dot 

dat 


= 0 


where = stands for the truncated derivative. The details 
about the implementation of truncated BPTT can be found 
in [12]. 


5. Experiments and Results 

We compare the performance of the proposed method 
with the state-of-the-art LSTM and non-LSTM methods 
present in existing literature on both 2D and 3D human ac¬ 
tion datasets. 

5.1. Datasets 

The proposed method is evaluated on the KTH 2D action 
recognition dataset, as well as MSR ActionSD dataset. 

KTH dataset. We choose KTH dataset [24] for it is a 
de facto benchmark for evaluating action recognition algo¬ 
rithms. This makes it possible to directly compare with the 
other algorithms. There are two KTH datasets - KTH-1 and 
KTH-2, which both consist of six action classes: walking, 
jogging, running, boxing, hand-waving and hand-clapping. 
The actions are performed several times by 25 subjects in 
four different scenarios: outdoors, outdoors with scale vari¬ 
ation, outdoors with different clothes and indoors. The se¬ 
quences are captured over homogeneous background with a 


static camera recording 25 frames per second. Each video 
has a resolution of 160 x 120, and lasts for about 4 seconds 
on KTH-1 dataset and for about a second for KTH-2 dataset. 
There are 599 videos in the KTH-1 dataset and 2, 391 video 
sequences in the KTH-2 dataset. 

MSR Action3D dataset. The MSR ActionSD dataset 
[18] consists of 567 depth map sequences performed by 10 
subjects using a depth sensor similar to the Kinect device. 
The resolution of each video is 320 x 240 and there are 
20 action classes where each subject performs each action 
two or three times. The actions are chosen in the context of 
gaming. They cover a variety of movements related to arms, 
legs, torso etc. This dataset has a lot of noise in the joint lo¬ 
cations of the skeleton as well as high intra-class variations 
and inter-class similarities, making it a challenging dataset 
for evaluation among the existing 3D datasets. We follow 
a similar experiment setting from [31], where half of the 
subjects are used for training and the other half are used 
for testing. This setting is much more challenging than the 
subset one used in [18], because all actions are evaluated 
together and the chance of confusion is much higher. 

5.2. Feature Extraction 

We are using densely sampled HOG3D features to repre¬ 
sent each frame of video sequences from the KTH dataset. 
Specifically, we uniformly divide the 3D video volumes into 
a dense grid, and extract the descriptors from each cell of 
the grid. The parameters for HOG3D are the same as the 
one used in [15]. We extract HOG3D features using the 
standard KTH optimized dense sampling parameters men¬ 
tioned on the authors’ webpage ^ The size of the descriptor 
was 1000 per cell of grid, and there are 56 such cells in 
each frame, yielding a 56, 000 dimensional feature vector 
per frame. We apply PCA to reduce the dimension to 450, 
retaining 97% of energy among the principal components, 
to construct a compact input into the dRNN model. 

For 3D action dataset, MSR Action3D, a depth sensor 
like Kinect provides an estimate of 3D joint coordinates of 
body skeleton, and the following features were extracted to 
represent MSR Action3D depth sequences - (1) Position: 
3D coordinates of the 20 joints obtained from the skeleton 
map. These 3D coordinates were then concatenated result¬ 
ing in a 60 dimensional feature per frame; (2) Angle: nor¬ 
malized pair-wise angles. The normalized pair-wise angles 
were obtained from 18 joints of the skeleton map. The two 
feet joints were not included. This resulted in a 136 di¬ 
mensional feature vector per frame; (3) Offset: offset of 
the 3D joint positions between the current and the previous 
frame [34]. These offset features were also computed us¬ 
ing the 18 joints from the skeleton map resulting in a 54 
dimensional feature per frame; (4) Velocity: histogram of 
the velocity components obtained from point cloud. This 

^http://lear.inrialpes.fr/people/klaeser/software. 







Dataset 

KTH 

MSR Action3D 

Input Units 

450 

583 

Memory Cell State Units 

300 

400 

Output Units 

6 

20 


Table 1. Architecture of the dRNN model used on two datasets. 
Each row shows the number of units in each component. For the 
sake of fair comparison, we adopt the same architecture for the 
dRNN models of both orders on two datasets. 


feature was computed using the 18 joints as in the previ¬ 
ous cases resulting in a 162 dimensional feature per frame; 
(5) Pairwise joint distances: The 3D coordinates obtained 
from the skeleton map were used to compute pairwise joint 
distances with the centre of the skeleton map resulting in a 
60 dimensional feature vector per frame. For the following 
experiments, these five different features were concatenated 
to result in a 583 dimensional feature vector per frame. 

5.3. Architecture and Training 

The architectures of the dRNN models trained on the two 
datasets are shown in Table 1 . For the sake of fair compar¬ 
ison, we adopt the same architecture for the dRNN models 
of both orders on two datasets. We can see that the number 
of memory cell units is smaller than the input units on both 
datasets. This can be interpreted as follows. The sequence 
of an action video often forms a continuous trajectory em¬ 
bedded in a low-dimensional manifold of the input space. 
Thus, a lower-dimension state space suffices to capture the 
dynamics of such a trajectory. 

We plot the learning curve for training the model on 
KTH dataset in Figure 2. The learning rate of BPTT algo¬ 
rithm is set to 0.0001. The figure shows that the objective 
loss continuously decreases over 50 epochs. Usually after 
40 epochs, the training of dRNN model begins to converge. 

5.4. Results on KTH Dataset 

There are several different evaluation protocols used on 
KTH dataset in literature. This can result in as large as 9% 
differences in performance across different experiment pro¬ 
tocols as reported in [8]. For the sake of fair comparison, we 
follow the cross-validation protocol [2], in which we ran¬ 
domly select 16 subjects to train the model, and test over 
the remaining 9 subjects. The performance is reported by 
the average across five such trails. 

First, we compare the dRNN model with the conven¬ 
tional LSTM model in Table 2. Here we report the cross- 
validation accuracy on both KTH-1 and KTH-2 datasets. In 
addition, Figure 4 shows the confusion matrix obtained by 
the 2-order dRNN model on KTH-1 dataset. This confusion 
matrix is computed by averaging over five trials in the above 
cross-validation protocol. The performance of conventional 
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Figure 3. The curve of the 1st and 2nd orders of DoS over an exam¬ 
ple of sequence for the action “boxing.” Note that the local maxi¬ 
mum of DoS corresponds to the change from punching to relaxing. 

LSTM has been reported in literature [11,2]. We note that 
these reported accuracies often vary with different types of 
features. Thus, a fair comparison between different models 
can only be made with the same type of input feature. 

For the dRNN model, we report the accuracy with up 
to the 2-order of DoS. The table shows that with the same 
HOG3D feature, the proposed dRNN models outperform 
the conventional LSTM model, and the 2-order dRNN 
yields a better accuracy than its 1-order counterpart. Al¬ 
though higher-order of DoS might improve the accuracy 
further, we do not report the result since it becomes trivial 
to simply add more orders of DoS into dRNN, and the im¬ 
proved performance might not compensate for the increased 
computational cost. For most of practical applications, the 
first two orders of dRNN should be competent enough. 

Baccouche et al. [2] reported an accuracy of 94.39% and 
92.17% on KTH-1 and KTH-2 data sets, respectively. But 
it is worth noting that they used a combination of 3DCNN 
and LSTM, where 3DCNN plays the crucial role in reach¬ 
ing such performance. Actually, 3DCNN model alone can 
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Figure 4. Confusion Matrix on the KTH-1 dataset obtained by the 
2-Order dRNN model. 

reach an accuracy of 91.04% and 89.40% on KTH-1 and 
KTH-2 data sets as reported in [2] . On the contrary, they re¬ 
ported that the LSTM with HarrisSD feature only achieved 
87.78% on KTH-2, as compared with 92.12% accuracy ob¬ 
tained by 2-order dRNN with HOG3D feature. In Table 2, 
under a fair comparison with the same feature, the dRNN 
models of both orders outperform their LSTM counterpart 
with the same HOG3D feature. 

In Figure 3, to support our motivation of learning LSTM 
representations based on the dynamic change of states 
evolving over frames, we illustrate some example frames 
of “boxing” action versus the curve of I/ 2 -norm of 1-order 
and 2-order DoS on KTH dataset. It shows the change 
from “punching” to “relaxing” at the local maximum of 
DoS, showing the ability of the dRNN model to capture the 
salient dynamics for the action. 

We also show the performance of the other non-LSTM 
state-of-the-art approaches in Table 3. Many of these com¬ 
pared algorithms focus on the action recognition prob¬ 
lem, relying on the special assumptions about the spatio- 
temporal structure of actions. They might not be applicable 
to model the other type of sequences which do not satisfy 
these assumptions. In contrast, the proposed dRNN model 
is a general-purpose model, not being tailored to specific 
type of action sequences. This also makes it competent on 
3D action recognition task as we will show below. 

5.5. Results on MSR Action3D Dataset 

Table 4 compares the results on MSR Action3D dataset, 
and Figure 5 shows the confusion matrix by the 2-order 
dRNN model. The results are obtained by following ex¬ 
actly the same experimental setting in [31], in which half of 
actor subjects are used for training and the rest are used for 
testing. This is in contrast to another evaluation protocol in 
literature [18] that splits across 20 action classes into three 


Dataset 

Method 

Accuracy 


LSTMHOF [11] 

90.7 

KTH-1 

LSTM HOG3D 

89.93 

1-order dRNN+ HOG3D 

93.28 


2-order dRNN + HOG3D 

93.96 


LSTM Harris3D [2] 

87.78 


LSTM HOG3D 

87.32 

KTH-2 

1-order dRNN+ HOG3D 

91.98 


2-order dRNN + HOG3D 

92.12 


Table 2. Cross-validation accuracy over five trails obtained by 
the proposed dRNN model in comparison with the conventional 
LSTM model on KTH-1 and KTH-2 data sets. 


Dataset 

Method 

Accuracy 


Rodriguez et al. [22] 

81.50 

KTH-1 

Jhuang et al.[13] 

91.70 


Schindler et al. [23] 

92.70 


3DCNN [2] 

91.04 


Jietal. [14] 

90.20 


Taylor etal. [28] 

90.0 

KTH-2 

Laptev etal. [17] 

91.80 


Dollar et al. [5] 

81.20 


3DCNN [2] 

89.40 


Table 3. Cross-validation accuracy over five trials obtained by the 
other compared algorithms on KTH-1 and KTH-2 datasets. 


Method 

Accuracy 

Actionlet Ensemble [31] 

88.20 

HON4D [20] 

88.89 

DCSF [32] 

89.3 

Lie Group [29] 

89.48 

LSTM 

87.78 

1-order dRNN 

91.40 

2-order dRNN 

92.03 


Table 4. Comparison of the dRNN model with the other algorithms 
on MSR Action3D dataset. 


subsets and performs the evaluation within each individual 
subset. The evaluation protocol we adopt is more challeng¬ 
ing because it is evaluated over all 20 action classes with no 
common subjects in training and test sets. 

From the results, the dRNN models of both orders out¬ 
perform the conventional LSTM algorithm with the same 
feature. Also, both dRNN models perform competitively 
as compared with the other algorithms. We notice that the 
Super Normal Vector (SNV) model [33] has reported an ac¬ 
curacy of 93.09% on MSR Action3D dataset. However, this 
model is based on a special assumption about the 3D geo¬ 
metric structure of the surfaces of depth image sequences. 
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Figure 5. Confusion Matrix on the MSR AetionSD dataset by the 
2-Order dRNN model 


Thus, this approach is a very special model for solving 3D 
action recognition problem. This is contrary to dRNN as a 
general model without any specific assumptions on the dy¬ 
namic structure of the video sequences. 

In brief, through the experiments on both 2D and 3D hu¬ 
man action datasets, we show the competitive performance 
of dRNN compared with both LSTM and non-LSTM mod¬ 
els. This demonstrates its wide applicability in representing 
and modeling the dynamics of both 2D and 3D action se¬ 
quences, irrespective of any assumptions on the structure of 
video sequences. 

6. Conclusion and Future Work 

In this paper, we present a new family of differential Re¬ 
current Neural Networks (dRNNs) that extend Long Short- 
Term Memory (LSTM) structure by modeling the dynam¬ 
ics of states evolving over time. The new structure is bet¬ 
ter at learning the salient spatio-temporal structure. Its gate 
units are controlled by the different orders of derivatives of 
states, making the dRNN model more adequate for the rep¬ 
resentation of the long short-term dynamics of actions. Ex¬ 
periment results on both 2D and 3D human action datasets 
demonstrate the dRNN model outperforms the conventional 
LSTM model. Even in comparison with the other state- 
of-the-art approaches based on strong assumptions about 
the motion structure of actions being studied, the general- 
purpose dRNN model still demonstrates much competitive 
performance on both 2D and 3D datasets. In the future 
work, we will test dRNN in combination with more sophis¬ 
ticated input feature sequences to explore the specific mo¬ 
tion structure of actions. 
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